Detecting Motifs in a Large Data Set: Applying Probabilistic Insights to Motif Finding

نویسندگان

  • Christina Boucher
  • Daniel G. Brown
چکیده

We give a probabilistic algorithm for Consensus Sequence, a NP-complete subproblem of motif recognition, that can be described as follows: given set of l-length sequences, determine if there exists a sequence that has Hamming distance at most d from every sequence. We demonstrate that distance between a randomly selected majority sequence and a consensus sequence decreases as the size of the data set increases. Applying our probabilistic paradigms and insights to motif recognition we develop pMCL-WMR, a program capable of detecting motifs in large synthetic and real-genomic data sets. Our results show that detecting motifs in data sets increases in ease and efficiency when the size of set of sequence increases, a surprising and counter-intuitive fact that has significant impact on this deeply-investigated area.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lecture 6 The EM Algorithm , Mixture Models , and Motif

In a previous class, we discussed an algorithm for learning a probabilistic matrix model which describes a fixed-length motif in a set of sequences S : : : S over an alphabet A. This algorithm is one of a class of methods collectively known as expectation maximization, or EM. We will describe the general EM algorithm, then derive the motif-finding algorithm by applying EM to learn a specific pr...

متن کامل

Identification of Transcription Factor Binding Sites in Promoter Regions by Modularity Analysis of the Motif Co-occurrence Graph

Many algorithms have been proposed to date for the problem of finding biologically significant motifs in promoter regions. They can be classified into two large families: combinatorial methods and probabilistic methods. Probabilistic methods have been used more extensively, since their output is easier to interpret. Combinatorial methods have the potential to identify hard to detect motifs, but...

متن کامل

MotifCut: regulatory motifs finding with maximum density subgraphs

MOTIVATION DNA motif finding is one of the core problems in computational biology, for which several probabilistic and discrete approaches have been developed. Most existing methods formulate motif finding as an intractable optimization problem and rely either on expectation maximization (EM) or on local heuristic searches. Another challenge is the choice of motif model: simpler models such as ...

متن کامل

Genetic Algorithm Based Probabilistic Motif Discovery in Multiple Unaligned Biological Sequences

Many computational approaches have been introduced for the problem of motif identification in a set of biological sequences, which are classified according to the type of motifs discovered. In this study, we propose a model to discover motif in large set of unaligned sequences in considerably minimum time using genetic algorithm based probabilokistic Motif discovery model. The proposed algorith...

متن کامل

Finding motifs from all sequences with and without binding sites

MOTIVATION Finding common patterns, motifs, from a set of promoter regions of coregulated genes is an important problem in molecular biology. Most existing motif-finding algorithms consider a set of sequences bound by the transcription factor as the only input. However, we can get better results by considering sequences that are not bound by the transcription factor as an additional input. RE...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009